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Abstract 

This paper presents an approach to detect and track 
groups of people in video- surveillance applications, and to 
automatically recognize their behavior. This method keeps 
track of individuals moving together by maintaining a spa- 
cial and temporal group coherence. First, people are in- 
dividually detected and tracked. Second, their trajectories 
are analyzed over a temporal window and clustered using 
the Mean-Shift algorithm. A coherence value describes how 
well a set of people can be described as a group. Further- 
more, we propose a formal event description language. The 
group events recognition approach is successfully validated 
on 4 camera views from 3 datasets: an airport, a subway, a 
shopping center corridor and an entrance hall. 



1. Introduction 

In the framework of a video understanding system (fig- 
ure [T]), video sequences are abstracted in physical objects: 
objects of interest for a given application. Then the physical 
objects are used to recognize events. In this paper, we are 
interested by the group behavior in public spaces. Given a 
set of detected and tracked people, our task is finding asso- 
ciations of those people into spatially and temporally coher- 
ent groups, and detecting events describing group behavior. 




Figure 1 . Description of the proposed video understanding system 



Tracking people, and especially groups of people in rel- 
atively an unconstrained, cluttered environment is a chal- 
lenging task for various reasons. In 0, Ge et al. propose 
a method to discover small groups of people in a crowd 



based on a bottom-up hierarchical clustering approach. Tra- 
jectories of pedestrians are clustered into groups based on 
their closeness in terms of distance and velocity. The ex- 
periments of this work have been made on videos taken 
from a very elevated viewpoint, providing few occlusions. 
Haritaoglu et al. |9] detect and track groups of people as 
they shop in a store. Their method is based on searching 
strongly connected components in a graph created from tra- 
jectories of individual people, following the idea that people 
belonging to the same group have a lower inter-distance. 
This method however does not allow group members to 
move away and return to the group without being discon- 
nected from it. Furthermore, the application of a shopping 
queue lacks genericity (people are rather static and have a 
structured behavior), it is not clear how well this method 
is adaptable to another context of use. Other approaches, 
such as fT4lL aim at detecting specific group-related events 
(e.g. queues at vending machines) without tracking. Here 
again, the method does not aim at consistently tracking a 
group as its dynamics vary. In ifTOlL an algorithm for group 
detection and classification as voluntary or involuntary (e.g. 
assembled randomly due to lack of space) is proposed. A 
top-down camera is used to track individuals, and Voronoi 
diagrams are used to quantify the sociological concept of 
personal space. No occlusion handling is done in this work 
hence the applicability to other points of view of the camera 
or to denser scenes is questionable. Figure [2] shows the re- 
sult of our event recognition on tracked groups on a dataset 
recorded at the Eindhoven airport. 

Event recognition is a key task in automatic under- 
standing of video sequences. In this work we are mainly 
interested in group events, but the usual techniques can 
be applied to different kinds of objects (person, vehicle, 
group,...). The typical detection algorithm (figure [TJ takes 
as input a video sequence and extracts interesting objects 
(physical objects). This abstraction stage is the layer be- 
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Figure 2. Group event recognition on Eindhoven airport se- 
quences. Left: The group is detected as splitting into 2 sub-groups. 
Right: Two groups are detected as merging into one. 

tween the image and the semantic worlds. Then, these ob- 
jects of interest are used to model events. Finally, the events 
are recognized. The abstraction stage determines which 
modeling techniques can be applied. 
The possible abstraction technique can be pixel based Q 
or object based fTTll . The first kind of techniques is not 
well adapted for groups. Indeed, persons belonging to the 
same group are not necessary physically linked. With ob- 
ject abstraction, a video sequence is represented thanks to 
the detected objects (persons, vehicles,...) and their associ- 
ated properties (speed, trajectory,...). In the literature, sev- 
eral approaches use this abstraction level lfT6l . because an 
activity can be naturally modeled based on those properties. 

Lavee et al. ifTTl classify existing event modeling 
techniques in three categories: the pattern recognition 
models, the state based models and the semantic models. 
The pattern recognition models are classical recognition 
techniques using classifiers as the nearest neighbor method, 
boost techniques, support vector machines and neural 
networks EJ. These techniques are well formalized. But 
adding new types of events implies the training of new 
classifiers. 

The state based models formalize the events in spatial and 
temporal terms using semantic knowledge: Finite State Ma- 
chines (FSM), Bayesian Networks (BN), Hidden Markov 
Models (HMM), Dynamic Bayesian Networks (DBN) 
and Conditional Random Fields (CRF). The HMMs and 
all their variants are heavily used for event modeling l6|. 
They take the advantages of the FSM (temporal modeling) 
and of the BNs (probabilistic modeling). But due to the 
nature of the HMMs (time sliced structure), the complex 
temporal relations (e.g. during) are not easily modeled. Lin 
et al. [12] propose an asynchronous HMM to recognize 
group events. Brdiczka et al. construct in [2] HMMs 
upon conversational hypotheses to model group events 
during a meeting. One drawback of the modified HMM 
methods is that since the classical structure of HMMs is 
modified, efficient algorithms can not be applied without 
approximation. 

The semantic models define spatio-temporal relations 
between sub-events to model complex events. Due to the 
nature of these models, the events must be defined by an 
expert of the application domain. Moreover, these models 
are often deterministic. Several techniques are studied: 



grammar based models, Petri nets (PN), constraint solving 
models and logic based models. 

As shown in this section, the quantity of techniques for 
abstraction and event modeling is huge. In this paper, we 
propose a framework (ScReK: Scenario Recognition based 
on Knowledge) to easily model the semantic knowledge of 
the considered application domains: the objects of interest 
and the scenario models, and to recognize events associated 
to the detected group based on spatio-temporal constraints. 

In the rest of this paper, we first describe our technique 
to detect and track groups of people (section [2|, then we de- 
scribe our event detection method applied to tracked groups 
(section [3}. Section |4]presents evaluations. 

2. Group Tracking 

Given a set of detected and tracked people, the proposed 
method focuses mainly on the task of finding associations of 
those people into spatially and temporally coherent groups. 
The human definition of a group is people that know each 
other or interact with each other. In fact, according to 
McPhail fT3l : Two defining criteria of a group [are] prox- 
imity and/or conversation between two or more persons. It 
is quite difficult to directly detect people interactions and 
conversation in a video or the fact that people know each 
other. For automatic recognition we derive this definition: 
two or more people who are spatially and temporally close 
to each other and have similar direction and speed of move- 
ment, or better: people having similar trajectories. 

Group tracking is based on people detection. The people 
detection can be performed by various methods. We have 
compared several methods and chosen the best one, it is 
based on background-subtraction described in ||T8| because 
of the quality of its results (see table [T] for a comparison of 
several methods). 

Blobs of foreground pixels are grouped to form physical 
objects (also called mobiles) classified into predefined cat- 
egories based on the 3D size of objects (using a calibrated 
camera): group_of_persons, person and noise. When 
people overlap (which happens quite often with a low view- 
point, such as in figure [6]) or are too close to each other, seg- 
mentation fails to split them and they are detected as a single 
object classified as group_of_persons because its size is 
bigger than the size of a single person. Thoses classes of ob- 
jects are specified using gaussian functions. Mean, sigma, 
min and max values are provided for each class and a score 
is computed representing how well an object's dimensions 
fit in each category. The category with the best score is as- 
signed as the class of the object. Detected objects at each 
frame are tracked consistently on the long term using a mul- 
tiple feature-based tracker |3l . 

Individual trajectories are the input of the group tracking 
algorithm, which is divided into four parts: creation, update, 




Figure 3. Example of 5 tracked people clustered into 3 trajectory 
clusters. A group is created from cluster 2. 



split/merge and termination. In order to detect temporally 
coherent groups, we observe people trajectories over a time 
window, denoted delay T. In the experiments presented sec- 
tion[4j we used T = 20 frames. Working at frame t c — T, t c 
being the current frame of the video stream, we cluster tra- 
jectories of individuals between frames t c — T and t c to find 
similar trajectories, representative of groups. We choose 
the Mean- Shift clustering algorithm [ 7 ] because it does not 
require to set as input the number of clusters. However, 
Mean-Shift does require a tolerance parameter determining 
the size of the neighborhood for creating clusters. Figure [3] 
shows the input prints and the clustering result. 

A trajectory is defined as Traj — = 
0...r-l}U{(s ai ,s w ),« = 1...T-1} where {x uyi ),i G 
[0; T — 1] in each trajectory is the position of a group in the 
same frames, and (s Xi ,s yi ) = speed(i — \,i\i G [1;T — 1] 
is the speed of the group between frames i — 1 and i. If k 
positions on the trajectory are missing because of lacking 
detections, we interpolate the k missing positions between 
known ones. Each trajectory is a point in a 2(2T — 1)- 
dimensional space. Mean-Shift is applied on a set of such 
points. To make the approach more generic and being able 
to add other features, we normalize the values using mini- 
mum and maximum ranges. The range of positions on the 
ground plane is determined by the field of view. The min- 
imum speed is and the maximum speed is set to 10 m/s, 
greatly exceeding all observed values. From the raw value 
of x, y and s (the speed) denoted by r G [ram, max], we 
compute the corresponding normalized value n G [0, 1] as: 
n = r ~ mm , where min and max are the respective min- 
imum and maximum values. We set the tolerance to 0.1, 
considering grouping trajectories distant by less than 10% 
of the maximum. This value is quite low because cluster- 
ing is used only to group very close people, the case where 
people temporarily split being handled by the update step 
described below. 

We characterize a group by three properties: the 
average over the frames in which the group is de- 
tected of the inter-mobile distance and the average over 
frames of standard deviations of speed and direction. 
These properties enable the definition of a coherence 
criterion: group Incoherence = uo\ • distance Avg + 
co>2 • speedStdDev + ujs • directionStdDev, where the 
weights oji, co>2 and ujs are normalization parameters. We 



use uoi = 7 and UJ2 = ^3 = 5 to slightly favor distance 
over speed and direction similarity which are quite noisy. 
With this definition, a low value of group Incoherence is 
significative of a group. 

Groups are created from clusters of more than one phys- 
ical object. In the case where one group_of_persons ob- 
ject is detected at frame t c — T, we analyze its trajectory 
through the time window. If this object stays the size of a 
group, or is close to other objects, we can create a group 
and compute its group Incoherence. If the resulting value 
is low enough, we keep the created group. In case of a single 
group_of_persons object, the group Incoherence value 
is naturally very low because of a null distance Avg com- 
ponent. The creation step is made up of these two cases. 

Group dynamics vary. Sometimes all group members do 
not have similar trajectories, for example when the group is 
waiting while one member buys a ticket at a vending ma- 
chine. Clustering is not enough to correctly update an ex- 
isting group in that case. First, we try to associate clusters 
with groups existing at the previous frame, using the notion 
of probable group of a mobile, defined hereafter. During 
tracking, mobiles detected at different frames are connected 
by probabilistic links in order to track consistently the same 
real objects. We use the term father and son for the mobiles 
resp. in the oldest and most recent frame of the link. If a 
father, within a window of T frames, of the mobile ra was 
in a group g and the link probability between father and son 
is above a given threshold (a value of 0.6 is usually used 
in the experiments section [4]), then the father's group g is 
called the probable group of the mobile ra: PG(ra) = g. 
Each cluster c is associated with the probable group of most 
mobiles in the cluster: G(c) = argmax^ G |^ C | |{gf |gf = 
g}\, where G(c) is the group associated with cluster c and 
{di} ~ {PG( m i)} me set °f probable groups of mobiles 
belonging to cluster c ({ra?} being the set of mobiles in 
cluster c). Several clusters can be associated to the same 
group, ensuring that group members having temporarily di- 
verging trajectories will be kept in the group for a minimal 
amount of time. Each mobile m\ is added to the group G(c) 
if this group is really the probable group of the considered 
mobile: PG(ra|) = G(c). In fact, the update step aims at 
tracking existing members of the group and not new com- 
ers. This procedure is summarized in algorithm [T] 

The split of groups operates naturally. When a mobile 
from a group has moved away for too many frames, its prob- 
able group becomes empty and it cannot be added to an ex- 
isting group during the update step, so it splits. It may be 
part of a new group in the creation step, if it gets clustered 
together with other mobiles. 

Two groups gi and #2 can be merged if two mobiles, one 
in each group at frame t c — T + k (k G [0; T — 1]), have the 
same son at frame t c — T + I (I G [k + 1; T — 1]), meaning 
that the two mobiles will merge. The oldest group among 



Algorithm 1: Update of groups. 

input : {groups tc -T-i}, {mobiles tc -T} 
output: updated {groups t c -r} 
{cluster s tc -r} = M eanS 'hi } 't ({mobiles tc -T})\ 
for c G {cluster s t c -t} do 
for m\ G {m c } do 

L ^ C = ^K); 

_ G(c) = argmax pG{p c } \{g?\g? = g}\; 

for m\ G mobiles tc -T do 
if PG{mfj = G(c) then 
L G(c).add(m^); 



gi and #2 is kept and all mobiles of the disappearing group 
are added into the remaining group. 

The group termination step erases old groups. Mobiles 
that have been detected at a largely outdated frame (e.g. 
t c — 5T) are deleted at frame t c — T and empty groups 
are erased. As a consequence, groups having no new mo- 
biles for 5T frames are erased. All existing groups, even 
currently empty ones, can potentially be updated. 

Finally, the output of the group tracker, which is the input 
of the event detection, is a set of tracked groups (keeping a 
consistent id through frames) having properties (such as the 
intra-objects distance) and composed of detected physical 
objects at each frame. 
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Figure 4. Knowledge modeling for video event recognition. 



3. Event Recognition: a Generic Framework 

In this work, a generic framework for event recognition 
is proposed (ScReK). The motivation of this framework is 
to use it within any video understanding application. The 
genericity is obtained in terms of objects of interest and 
event models. We can identify two main parts in an event 
recognition process: the application knowledge (what are 
the expected objects? what are the event models?) and the 
event recognition algorithm. 

Knowledge representation is a key issue for genericity. We 
believe that the knowledge should be modeled with the col- 
laboration of two different categories of people (figure]?]): 
vision experts (specialists in vision algorithms), and appli- 
cation domain experts (specialists in the expected events 
of their domain). Vision experts are modeling the objects 
of interest (detected by vision algorithms) and the video 
primitives (properties computed on the detected objects of 



interest). Domain experts have to model the expected appli- 
cation events. 

Usually, for video event recognition, knowledge is repre- 
sented using OWL (Web Ontology Language). Even with 
tools like Protege, it is difficult for a non computer spe- 
cialist to create her/his own model without a long and te- 
dious learning of the OWL formalism. The ScReK frame- 
work proposes its own declarative language to easily de- 
scribe the application domain knowledge: the ontology. 
ScReK proposes a grammar description of the objects and 
events using the extended BNF (Backus Naur Form) rep- 
resentation. Oi, is described by its parent, Oj, and its at- 
tributes: Oi = {a/c}/c=o,...o™- The objects are defined us- 
ing an inheritance mechanism. The object Oi inherits all 
the attributes of its parent Oj . The attributes are described 
with the help of basic types. 1 1 basic types are predefined: 
boolean, integer, double, timestamp, time interval, 2D point 
(integer and double), 3D point (integer and double), and list 
of 3D points. The user can contribute by adding new basic 
types. Moreover, a history of the values of a basic type is 
automatically stored. It is useful for vision primitives based 
on the evolution of a value in time (e.g. trajectory). 
For group behavior recognition, detected group objects 
within the video sequence and scene context objects (zone, 
equipment) are described. The scene context objects help 
to recognize specific events (e.g. by defining a forbidden 
access zone or a threshold). For instance, the class of group 
objects is defined as follows in the ScReK language: 

class Group : Mobile { 
const false ; 
CSInt NumberOfMobiles ; 
CSDouble AverageDistMobiles ; } 

A Group is a Mobile and it inherits all the attributes of a 
Mobile object (3D size, 3D position,...). A Group is not con- 
stant (dynamic, i.e. its attributes values can change through- 
out time). One of its attributes, NumberOfMobiles is the 
number of objects which compose the group. 
The second kind of knowledge to represent is the event 
models. They are composed of 6 parts: (1) the type of the 
scenario can be one of the following: PrimitiveState, Com- 
positeState, Primitive Event, CompositeEvent, from the sim- 
plest to the most complex events. (2) the name of the event 
model which can be referenced for more complex events. 
(3) the list of physical objects (i.e. objects of interest) in- 
volved in the event. The type of the objects is depending 
on the application domain. (4) the list of components con- 
tains the sub-events composing the event model. (5) the 
list of constraints for the physical objects or the compo- 
nents. The constraints can be temporal (between the com- 
ponents) or symbolic (for physical objects). (6) the alarm 
information describes the importance of the scenario model 
in terms of urgency. Three values are possible, from less ur- 
gent to more urgent: noturgent, urgent, veryurgent. 



The alarm level can be used to filter recognized events, for 
displaying only important events to the user. Hereafter is a 
sample event model: 

CompositeEvent ( browsing , 

PhysicalObjects ( ( g : Group ) , ( e : Equipment ) ) 
Components (( cl : Group _S top (g )) 

( c2 : Group _Near .Equipment (g , e ) ) ) 
Constraints (( e— >Name = " shop_window " ) ) 
Alarm ((Level : URGENT))) 

The application domain expert models the event brows- 
ing by "a group is stopped in front of the shop- window" 
with the model above. The vision expert models the sub- 
event Group -Near -Equipment (by measuring the distance 
between a group and an equipment) and GroupStop (by 
computing the speed of a group). 

The last part of the event recognition framework is the 
recognition algorithm itself. The proposed algorithm solves 
spatio-temporal constraints on the detected groups. The 
usual algorithms to recognize such events can be time con- 
suming. The ScReK framework proposes to define optimal 
event models: at most two components, at most one tempo- 
ral constraint (Allen's algebra) between these components. 
This property is not restrictive since all event models can be 
optimized in this format. Thanks to the optimal property, 
the event model tree is computed. The tree defines which 
sub-event (component) triggers the recognition of which 
event: the sub-event which happens last in time triggers the 
recognition of the global event. For instance, the event A 
has two components B and C with constraint: B before C. 
The recognition of C triggers the recognition of A. The tree 
triggers the recognition of the only events that can happen, 
decreasing the computation time. 

The first step of the event recognition process is to recog- 
nize all the possible simple events (most of these events 
are based on the vision primitives) by instantiating all 
the models with the detected objects (e.g. instantiating 
the model Group_Stays_Inside_Zone (takes as in- 
put one group and one zone) for all the detected groups and 
all the zones of the context). The second step consists in 
recognizing complex events according to the event model 
tree and the simple events previously recognized. The final 
step checks if the recognized event at time t has been al- 
ready recognized previously to update the event (end time) 
or create a new one. 

4. Results 

People detection is an input to group detection. We com- 
pared several methods to validate our choice of method. Ta- 
ble [T] sums up the results of an evaluation done on a 36006 
frames sequence (approximately 2 hours of video) in which 
37 ground truth (GT) objects (people) have been annotated. 
1 5 ] is a feature-based people detector whereas 031 and fT8l 
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Figure 5. Proposed group event ontology. 
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Table 1. Comparison several people detection methods. 

both perform motion segmentation and classification of de- 
tected objects. The method C combines the first two meth- 
ods for a more robust detection than each one separately. 
The method from lITSl gives the best results and is used as 
input of the group tracking process. This method learns 
a background model, resulting in better motion segmenta- 
tion and better detection of small objects (far from the cam- 
era) and static objects. The drawback is the time necessary 
to learn the model and the low speed of the background- 
subtraction. 

We have performed evaluation of the group tracking al- 
gorithm using 4 different views from 3 datasets: videos 
recorded for the european project VANAHEI]vf]in the Turin 
subway (figure [6]), videos recorded for the european project 
ViCoMc^] at the Eindhoven airport (figure [2} and videos 
from the benchmarking CAVIAR^] dataset: the INRIA en- 
trance and the shopping center corridor. In tables [2] and [3] 
the following metrics are used. The fragmentation metric 
computes throughout time how many tracked objects are as- 
sociated with one reference object (ground truth data). The 
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Figure 6. Event detection in Turin subway. 



tracking time metric measures the percentage of time during 
which a reference object is correctly tracked. The purity 
computes the number of reference object IDs per tracked 
object. A value close to 1 is significative of a good tracking. 

Table [2] shows group detection and tracking results on 
16 sequences from the CAVIAR dataset. The first 9 are se- 
quences from INRIA, and the remaining are from the shop- 
ping center corridor. One can notice that the shopping view 
is far more challenging than the hall because more people 
are visible and there are more occlusions due to the low po- 
sition of the camera. Table contains the results of this 
evaluation on 3 annotated sequences (resp. 128, 1373 and 
17992 frames) from the Turin subway dataset. In both ta- 
bles, detection results are good for almost all sequences. In 
the sequence c21sl, ground truth groups in the far end of 
the corridor fail to be detected because of the limitations of 
the background-subtraction method. Tracking shows good 
results with a few exceptions. For instance, sequence 2 of 
table [3] contains a main group present in the foreground for 
the whole duration of the sequence. This group is correctly 
tracked with only one id-switch, but many groups are an- 
notated far in the background and are difficult to detect for 
the motion segmentation algorithm. Their sparse detection 
results in many id- switches for group tracking. At the best 
of our knowledge, there is no possibility of comparing our 
method to an existing one (no public results or code avail- 
able). 

One major achievement of this paper is an ontology 
for group events based on video sensor (figure [5}. The 
ontology is composed of 49 event models (45 models 
are generic and re-usable in any application with groups 
(Group stop, Group Jively,...), 4 models are specifically 
defined for the applications of this paper (the events depend 
on the application context, enter shop,...)). The events have 
been modeled with help of metro surveillance staff. 

The results of the group event recognition are given in 
table [4] for the interesting events. Examples of event recog- 
nition are shown in figures [2| [6] and [7] There is only a few 
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Table 2. Results of group detection and tracking on 16 CAVIAR 
sequences. (Seq - official sequence name, Prec - Precision, Sens 
- Sensitivity, Frag - Fragmentation, TT - Tracking Time) 
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Table 3. Results of group detection and tracking on 3 sequences 
from the Turin subway. (Prec - Precision, Sens - Sensitivity, Frag 
- Fragmentation, TT - Tracking Time) 
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Table 4. Group event recognition for the 3 video datasets 



instances of each event because we only focus on meaning- 
ful group events. The events are correctly recognized with 
low false positive and false negative rates. Most of the false 
positive detections for the event getting off train are due to 
the fact that the door in the foreground is detected as a per- 
son when open. The errors can be corrected by adding a 
new video primitive: door detector. 




Figure 7. Group event recognition for the CAVIAR sequences, a. 
Fighting then splitting, b. Exit from the shop. c. Browsing, d. The 
mis-detected group (ghost due to reflections) is browsing. 

5. Conclusions 



We propose a generic, plug and play framework for 
event recognition from videos: ScReK. The scientific 
community can share a common ontology composed of 
event models and vision primitives. We demonstrate this 
framework on 4 group behavior recognition applications, 
using a novel group tracking approach. This approach 
gives satisfying results even on very challenging datasets 
(numerous occlusions and long duration sequences) such 
as in figure [6] The vision primitives are based on global 
attributes of groups (position, speed, size). The proposed 
event detection approach correctly recognizes events but 
shows its limitation for some specific events (e.g. fighting is 
best characterized by internal group movement). Adapted 
vision primitives, such as optical flow, solve specific 
limitations and are easy to plug into ScReK. Moreover, in 
this work the gap between video data and semantical events 
is modeled manually by vision experts, the next step is to 
learn automatically the vision primitives. 
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